You’ve decided to open a small robot-run cafe in Los Angeles. The project is promising but expensive, so you and your partners decide to try to attract investors. They’re interested in the current market conditions—will you be able to maintain your success when the novelty of robot waiters wears off? You’re an analytics guru, so your partners have asked you to prepare some market research. You have open-source data on restaurants in LA.
Link to presentation https://drive.google.com/file/d/1TVUoAm0YTSeMQf9cgDr5X0IuSHaHDvxh/view?usp=sharing
!pip install usaddress
Collecting usaddress Downloading usaddress-0.5.10-py2.py3-none-any.whl (63 kB) Requirement already satisfied: future>=0.14 in c:\users\edeng\anaconda3\lib\site-packages (from usaddress) (0.18.2) Collecting python-crfsuite>=0.7 Downloading python_crfsuite-0.9.8-cp39-cp39-win_amd64.whl (158 kB) Collecting probableparsing Downloading probableparsing-0.0.1-py2.py3-none-any.whl (3.1 kB) Installing collected packages: python-crfsuite, probableparsing, usaddress Successfully installed probableparsing-0.0.1 python-crfsuite-0.9.8 usaddress-0.5.10
import pandas as pd
from plotly import graph_objects as go
import plotly.express as px
import usaddress
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
rest_data=pd.read_csv('/Users/edeng/Downloads/rest_data_us.csv')
rest_data.info()
#show nan values
print(rest_data[rest_data.isna().any(axis=1)])
#we have 3 rows with nan values in the chain column.
#As it is impossible to fill this up, I will drop them
rest_data=rest_data.dropna()
#checking duplicates
rest_data[rest_data.duplicated()]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9651 entries, 0 to 9650
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 9651 non-null int64
1 object_name 9651 non-null object
2 address 9651 non-null object
3 chain 9648 non-null object
4 object_type 9651 non-null object
5 number 9651 non-null int64
dtypes: int64(2), object(4)
memory usage: 452.5+ KB
id object_name address chain object_type \
7408 19194 TAQUERIA LOS 3 CARNALES 5000 E WHITTIER BLVD NaN Restaurant
7523 19309 JAMMIN JIMMY'S PIZZA 1641 FIRESTONE BLVD NaN Pizza
8648 20434 THE LEXINGTON THEATER 129 E 3RD ST NaN Restaurant
number
7408 14
7523 1
8648 35
| id | object_name | address | chain | object_type | number |
|---|
rest_prop=rest_data.groupby('object_type')['id'].count().reset_index()
rest_prop
fig = go.Figure(data=[go.Pie(labels=rest_prop['object_type'], values=rest_prop['id'], title='Establishment Proportions')])
fig.show()
The majority of our data contain info about Restaurant types, while the least of our data contain info about Bakery types.
chain_prop=rest_data.groupby('chain')['id'].count().reset_index()
chain_prop
fig = go.Figure(data=[go.Pie(labels=chain_prop['chain'], values=chain_prop['id'], title='Is it a chain?')])
fig.show()
62% of the data is non chain establishments, while 38% are chains.
typical_chain=rest_data.groupby(['chain', 'object_type'])['id'].count().reset_index()
typical_chain.columns=['chain', 'Establishment', 'establishment_count']
fig = px.bar(typical_chain, x='Establishment', y='establishment_count', color='chain', title='Which type of establishment is typically a chain?')
fig.update_xaxes(tickangle=45)
fig.show()
v=typical_chain.Establishment.value_counts()
typical_chain[~typical_chain.Establishment.isin(v.index[v.gt(1)])]
| chain | Establishment | establishment_count | |
|---|---|---|---|
| 5 | True | Bakery | 283 |
Bakery is the establishment type which is typically a chain, as it is the only establishment which has no instances where it isn't a chain.
chain_char=rest_data[rest_data['chain']==True]
chain_char=chain_char.groupby(['object_name', 'object_type'])['number'].agg(['sum','count']).reset_index()
chain_char.columns=['object_name', 'establishment', 'number_of_seats', 'number_of_establishments']
print('The average seats per establishment is:', chain_char['number_of_seats'].sum()/chain_char['number_of_establishments'].sum())
print('The mean number of seats per chain:', chain_char['number_of_seats'].mean())
print('The mean number of establishments per chain:', chain_char['number_of_establishments'].mean())
chain_char=chain_char.groupby('establishment')[['number_of_seats', 'number_of_establishments']].sum().reset_index()
fig = px.bar(chain_char, x='establishment', y='number_of_seats', color='number_of_establishments', title='Number of seats to number of establishments in chains')
fig.update_xaxes(tickangle=45)
fig.show()
chain_char['seats_average']=chain_char['number_of_seats']/chain_char['number_of_establishments']
fig = px.bar(chain_char, x='establishment', y='seats_average', title='Seats ratio per chain establishment')
fig.update_xaxes(tickangle=45)
fig.show()
The average seats per establishment is: 39.69423286180631 The mean number of seats per chain: 53.35137111517368 The mean number of establishments per chain: 1.3440585009140769
There's an average of 40 seats for each establishment which indicate that there are few establishment with a lot of seats as a characterization of chains. We can also see that there are not a lot of establishments per chain, while average seats per chain is high at 53 seats.
avg_seats=rest_data.groupby('object_type')['number'].mean().reset_index().sort_values('number', ascending=False)
fig = px.bar(avg_seats, x='object_type', y='number', labels={
"object_type": "Restaurant Type",
"number": "Average Number of Seats"
}, title='Average seats by restaurant type')
fig.show()
Restaurants has the greatest number of seats.
def cleaning(raw):
raw_address=usaddress.parse(raw)
PostType= [item for item in raw_address if 'StreetNamePostType' in item]
streetName=[item for item in raw_address if 'StreetName' in item]
streetName=dict(streetName)
if PostType:
if (len(streetName)) > 1:
streetName=' '.join(map(''.join, streetName.keys()))
return streetName + ' ' + PostType[0][0]
else:
try:
streetName=sorted(streetName.keys(), reverse=True)
return streetName[0] + ' ' + PostType[0][0]
except:
return np.nan
else:
if (len(streetName)) > 1:
streetName=' '.join(map(''.join, streetName.keys()))
return streetName
else:
try:
streetName=sorted(streetName.keys(), reverse=True)
return streetName[0]
except:
return np.nan
rest_data['street_name']=rest_data.address.apply(cleaning)
rest_data
| id | object_name | address | chain | object_type | number | street_name | |
|---|---|---|---|---|---|---|---|
| 0 | 11786 | HABITAT COFFEE SHOP | 3708 N EAGLE ROCK BLVD | False | Cafe | 26 | EAGLE ROCK BLVD |
| 1 | 11787 | REILLY'S | 100 WORLD WAY # 120 | False | Restaurant | 9 | WORLD WAY |
| 2 | 11788 | STREET CHURROS | 6801 HOLLYWOOD BLVD # 253 | False | Fast Food | 20 | HOLLYWOOD BLVD |
| 3 | 11789 | TRINITI ECHO PARK | 1814 W SUNSET BLVD | False | Restaurant | 22 | SUNSET BLVD |
| 4 | 11790 | POLLEN | 2100 ECHO PARK AVE | False | Restaurant | 20 | ECHO PARK AVE |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 9646 | 21432 | HALL OF JUSTICE | 217 W TEMPLE AVE | False | Restaurant | 122 | TEMPLE AVE |
| 9647 | 21433 | FIN-MELROSE | 5750 MELROSE AVE | False | Restaurant | 93 | MELROSE AVE |
| 9648 | 21434 | JUICY WINGZ | 6741 HOLLYWOOD BLVD | True | Fast Food | 15 | HOLLYWOOD BLVD |
| 9649 | 21435 | MEDIDATE COFFEE | 548 S SPRING ST STE 100 | False | Cafe | 6 | SPRING ST |
| 9650 | 21436 | CAFE SPROUTS | 1300 S SAN PEDRO ST STE 111 | True | Restaurant | 19 | SAN PEDRO ST |
9648 rows × 7 columns
#checking nan street names
print(rest_data[rest_data['street_name'].isna()].reset_index())
# filling the nan values
rest_data[(rest_data['address'].str.contains('1033 1/2 LOS ANGELES ST'))]=rest_data[(rest_data['address'].str.contains('1033 1/2 LOS ANGELES ST'))].apply(lambda x: x.fillna('LOS ANGELES ST'))
rest_data[(rest_data['address'].str.contains('OLVERA ST'))]=rest_data[(rest_data['address'].str.contains('OLVERA ST'))].apply(lambda x: x.fillna('OLVERA ST'))
rest_data[rest_data['street_name'].isna()]
index id object_name address chain \ 0 1888 13674 MAGDALENO DEBBIE OLVERA ST E17 False 1 3719 15505 LAS ANITAS OLVERA ST 26 False 2 7003 18789 D.K. DONUTS 1033 1/2 LOS ANGELES ST False 3 7281 19067 CIELITO LINDO OLVERA ST 23 False 4 7319 19105 JUANITA'S CAFE OLVERA ST 20 False 5 7321 19107 KITTY'S SNOW CONES OLVERA ST 2 False 6 7438 19224 LA GOLONDRINA OLVERA ST W17 False 7 7741 19527 RODOLFO'S RESTAURANT OLVERA ST 5 True object_type number street_name 0 Restaurant 15 NaN 1 Restaurant 38 NaN 2 Fast Food 8 NaN 3 Restaurant 2 NaN 4 Restaurant 12 NaN 5 Restaurant 11 NaN 6 Restaurant 204 NaN 7 Restaurant 4 NaN
| id | object_name | address | chain | object_type | number | street_name |
|---|
top_10=rest_data.groupby('street_name')['id'].count().reset_index().sort_values('id', ascending=False).head(10)
top_10
| street_name | id | |
|---|---|---|
| 475 | SUNSET BLVD | 403 |
| 537 | WILSHIRE BLVD | 397 |
| 411 | PICO BLVD | 370 |
| 523 | WESTERN AVE | 367 |
| 211 | FIGUEROA ST | 332 |
| 393 | OLYMPIC BLVD | 309 |
| 502 | VERMONT AVE | 286 |
| 442 | SANTA MONICA BLVD | 264 |
| 27 | 3RD ST | 260 |
| 261 | HOLLYWOOD BLVD | 253 |
fig = px.bar(top_10, x='street_name', y='id', labels={
"street_name": "Street Name",
"id": "Number of Restaurants"
}, title='Number of restaurants by street')
fig.show()
Sunset BLVD has the highest number of restaurants at close to 400, followed by Wilshire BLVD and Pico BLVD, with 397 and 370 restaurants respectively. Number 10 on the list is Hollywood BLVD with 253 restaurants.
only_one=rest_data.groupby('street_name')['id'].count().reset_index()
only_one=only_one[only_one['id']==1].count()
only_one
street_name 250 id 250 dtype: int64
There are 250 street with only 1 restaurant.
top_rest=rest_data.groupby('street_name')['number'].agg(['sum', 'count']).reset_index().sort_values('count', ascending=False).head(10)
top_rest.columns=['street_name', 'seats_number', 'restaurants_number']
top_rest
fig = px.scatter(top_rest, x="restaurants_number", y="seats_number", color='street_name', size='restaurants_number')
fig.show()
sns.pairplot(top_rest);
From the above we can clearly see a positive correlation in the number of the restaurant with the number of seats. The higher the number of restaurants in the street, there is an increase in the number of seats.
#box plot
top_rest.set_index('street_name', inplace=True)
#creating a function that will show the names of the streets on the outliers
def make_labels(ax, boxplot):
fly = boxplot['fliers'][0]
med = boxplot['medians'][0]
# The x position of the median line
xpos = med.get_xdata()
# Lets make the text have a horizontal offset which is some
# fraction of the width of the box
xoff = 0.10 * (xpos[1] - xpos[0])
# The x position of the labels
xlabel = xpos[1] + xoff
# The median is the y-position of the median line
median = med.get_ydata()[1]
for flier in fly.get_ydata():
ax.text(1 + xoff, flier,
'{}'.format(top_rest.index.where(top_rest.seats_number==flier).dropna()[0]), va='center')
red_diamond = dict(markerfacecolor='r', marker='D')
fig, ax = plt.subplots()
ax.set_title('Seats number boxplot')
# Create the boxplot and store the resulting python dictionary
my_boxes = ax.boxplot(top_rest['seats_number'], flierprops=red_diamond)
# Call the function to make labels
make_labels(ax, my_boxes)
plt.xlabel('Seats Number')
plt.show()
We can see the street names where the number of seats are outliers in our data.
In conclusion, I inspected the data and viewed 3 Nan values in the chain field, which I removed as they were irrelevant to my task. There were no duplicates.
I viewed that 75% of the data is comprised of restaurants and 11% of fast food establishments, in addition 61% of the data is related to chains.
I found that bakeries in the data is only appearing as a chain. In addition I saw that chains tends to have fewer establishments with many seats, with restaurants having the highest average number of seats.
Lastly I inspected the 10 top streets in terms of establismnets count, and also inspected that 250 streets have 1 restaurant, and found that there is a positive correlation between the number of establishments on the street, and the number of seats.
Cafe establishments have 25 seats on average and consist of about 50% chains.
Per the above conclusion it may be recommended to open an average size cafe (25 seats) in a location which is not very packed with restaurants, but also don't have only 1. I recommed looking to open on streets which appear on the top 10 streets but also do not appear as an 3rd quarter outlier on the boxplot above, as it may indicate of overcrowding.